<!DOCTYPE html>
This procedure performs a two-group comparison of all genes in a transcriptome, or similar, data set. It accepts either read count data (RNA-seq) or normalized/log-transformed signal intensity (microarray or RNA-seq) as input. Gene-level analysis will be performed by applying a selected statistical test to the comparison of group means of each genes.
Transcriptome in immune cells of control-patient samples
Rna-seq data was generated from of 3 types of immune cells of 3 controls and 3 patients. Raw data was processed to get gene-level read counts.
The analysis results below were based upon the following:
This section analyzes the samples by summarizing their global expression patterns, through descriptive statistics, unsupervised clustering, and sample-sample correlation.
Distribution of average expression level and between-sample variance.
Figure 1.
By comparing gene expression patterns of different samples, observations can be made about the sample similarity. This section compares samples via data distribution, hierarchical clustering, and principal components analysis, which can potentially be used to identify outliers and confounding variables.
Figure 2. Each boxplot summarizes the expression measurements of all genes in one sample. Based on the assumption that all samples have approximately the same global distribution of gene expression measurements, all boxes should look similar regardless of which group they belong to. This assumption is not always true though. For example, normal cells and cancer cells could have dramatically different global patterns of gene expression. The notches of each box indicate the median, the lower and upper sides of the box represent the first and third quantile, and the individual data points out of the whiskers are the outliers (more than 1.5 inter-quantile range from the median).
Figure 3. This is an unsupervised clustering of samples used all genes in the data set. On the clustering tree, the vertical location of their lowest common node of any two samples represents their similarity (lower = more similar). Since the sample grouping information is not used, the splitting of samples into two sub-trees will indicate that these samples belong to different groups due to a known or unknown factor. Unexpected splitting often suggest outliers or confounding variables.
Figure 4. Principal Components Analysis (PCA) is also an unsupervised analysis that converts a large number of correlated variables (genes) into a smaller set of uncorrelated variables called principal components (PCs). Each principal component accounts for certain percentage of total variability of a data set so the PCs can be ordered by their percentages. This figure plots the top two PCs on the two axes. In general, samples closer to each other have more similar gene expression patterns. PCA can be used to identify sample features, such as age, disease, and treatment, that are associated with one or two PCs. It then can be concluded that these features are responsible for part of the total variability in the data set.
Figure 5. The differential expression of all 10162 genes can be visualized in different ways:
Table 1. Number of top DEGs selected via different cutoffs of FDR. FDRs are calculated using the Benjamini&Hochberg method (Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57, 289–300. 1995)
| FDR | Higher_in_Patient | Lower_in_Patient | Total |
|---|---|---|---|
| 0.01 | 1250 | 1743 | 2993 |
| 0.02 | 1453 | 1957 | 3410 |
| 0.05 | 1868 | 2321 | 4189 |
| 0.10 | 2240 | 2659 | 4899 |
| 0.15 | 2560 | 2920 | 5480 |
| 0.20 | 2790 | 3092 | 5882 |
| 0.25 | 2981 | 3272 | 6253 |
Figure 6. Top-ranked genes with increased (left) and decreased (right) expression in Patient. Click here to view differential expression of all genes.
Figure 7. Heatmap of the top 100 and 100 DEGs with higher (red) and lower (yellow) expression in Patient. Each row represents a DEG, whose expression measurements are normalized across samples. Samples are clustered by these genes and the columns are colored (blue = Control and red = Patient).
Click links below to view table or download files:
Check out the RoCA home page for more information.
The terms to represent differential expression can be used quite confusingly. In this report, fold change refers the ratio of two group means in their unlogged form. So a fold change of 2.0 means the average of the second group is increased to twice of the average of the first group; similarly, a fold change of 0.5 means the average is reduced to half. Log2(fold change) equals to the log2-transformation of the fold change. The table below gives a few examples of the conversion of these 2 variables. Log2(fold change) is more suitable for statistical analysis since it is symmetric around 0.
Supplemental Table 1. Fold Change vs. Log(Fold Change) vs. Percentage Change
| Fold change | Log2(fold change) | Percentage change (%) |
|---|---|---|
| 0.125 | -3.000 | -87.500 |
| 0.250 | -2.000 | -75.000 |
| 0.500 | -1.000 | -50.000 |
| 0.667 | -0.585 | -33.333 |
| 0.800 | -0.322 | -20.000 |
| 1.000 | 0.000 | 0.000 |
| 1.250 | 0.322 | 25.000 |
| 1.500 | 0.585 | 50.000 |
| 2.000 | 1.000 | 100.000 |
| 4.000 | 2.000 | 300.000 |
| 8.000 | 3.000 | 700.000 |
The key steps of statistical analysis in this report use existing R/Bioconductor packages and functions.
Supplemental Table 2. R/Bioconductor key functions
| Task | R package | R function |
|---|---|---|
| Hierarchical clustering | stats | hclust |
| PCA | stats | prcomp |
| Differential expression | DEGandMore | DeWrapper |
| Heatmap | stats | heatmap |
| Write Java HTML datatables | awsomics | CreateDatatable |
| Write data to Excel | xlsx | createWorkbook |
To reproduce this report:
Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory
To generate a new report using your own input data and parameter, edit the following items in the YAML file:
- **output** : where you want to put the output files
- **home** : the URL if you have a home page for your project
- **analyst** : your name
- **description** : background information about your project, analysis, etc.
- **input** : where are your input data, read instruction for preparing them
- **parameter** : parameters for this analysis; read instruction about how to prepare input data
if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }
CreateReport(filename.yaml); # filename.yaml is the YAML file you just downloaded and edited for your analysis
If there is no complaint, go to the output folder and open the index.html file to view report.
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X Yosemite 10.10.5
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] DESeq2_1.14.1 SummarizedExperiment_1.4.0
## [3] Biobase_2.34.0 GenomicRanges_1.26.4
## [5] GenomeInfoDb_1.10.3 IRanges_2.8.2
## [7] S4Vectors_0.12.2 BiocGenerics_0.20.0
## [9] xlsx_0.5.7 xlsxjars_0.6.1
## [11] rJava_0.9-9 DEGandMore_0.0.0.9000
## [13] snow_0.4-2 rchive_0.0.0.9000
## [15] colorspace_1.3-2 gplots_3.0.1
## [17] MASS_7.3-45 htmlwidgets_0.9
## [19] DT_0.2 kableExtra_0.9.0
## [21] awsomics_0.0.0.9000 yaml_2.1.16
## [23] rmarkdown_1.10.3 knitr_1.18
## [25] RoCA_0.0.0.9000 RCurl_1.95-4.9
## [27] bitops_1.0-6 devtools_1.13.4
##
## loaded via a namespace (and not attached):
## [1] bit64_0.9-7 RColorBrewer_1.1-2 httr_1.3.1
## [4] rprojroot_1.3-2 tools_3.3.3 backports_1.1.2
## [7] R6_2.2.2 rpart_4.1-10 KernSmooth_2.23-15
## [10] Hmisc_4.1-0 DBI_0.7 lazyeval_0.2.1
## [13] nnet_7.3-12 withr_2.1.1 gridExtra_2.3
## [16] bit_1.1-12 rvest_0.3.2 htmlTable_1.11.1
## [19] xml2_1.1.1 caTools_1.17.1 scales_0.5.0
## [22] checkmate_1.8.5 readr_1.1.1 genefilter_1.56.0
## [25] stringr_1.2.0 digest_0.6.13 foreign_0.8-67
## [28] XVector_0.14.1 base64enc_0.1-3 pkgconfig_2.0.1
## [31] htmltools_0.3.6 highr_0.6 rlang_0.1.6
## [34] rstudioapi_0.7 RSQLite_2.0 jsonlite_1.5
## [37] BiocParallel_1.8.2 gtools_3.5.0 acepack_1.4.1
## [40] magrittr_1.5 Formula_1.2-2 Matrix_1.2-8
## [43] Rcpp_0.12.14 munsell_0.4.3 stringi_1.1.6
## [46] zlibbioc_1.20.0 plyr_1.8.4 grid_3.3.3
## [49] blob_1.1.0 gdata_2.18.0 lattice_0.20-34
## [52] splines_3.3.3 annotate_1.52.1 hms_0.4.0
## [55] locfit_1.5-9.1 pillar_1.1.0 geneplotter_1.52.0
## [58] XML_3.98-1.9 evaluate_0.10.1 latticeExtra_0.6-28
## [61] data.table_1.10.4-3 gtable_0.2.0 ggplot2_2.2.1
## [64] xtable_1.8-2 survival_2.40-1 viridisLite_0.2.0
## [67] tibble_1.4.2 AnnotationDbi_1.36.2 memoise_1.1.0
## [70] cluster_2.0.5
END OF DOCUMENT